1 Preamble

1.1 Who I am?

A professor/researcher using R for more than 10 years and who developed more than 10 R packages.

1.2 Organization of the course

The course will be organized as follows:

  • lectures,
  • pratical sessions,
  • applied projects.

2 Introduction

2.1 The softwares for Data Science

R is one of the most used softwares for Data Science:

But there are of course good alternatives:

  • Python (the SciKitLearn initiative is very good and clearly is missing in R!)
  • not free softwares: SAS, Matlab, …

2.2 Why R for Data Science?

R is very good for DS because of:

  • connected to the researchers / newest technologies
  • easy to pulish new codes
  • a community behind the software

2.3 Working with R - Lesson 101

To start a project in R:

  • you’ll need a Rstudio session
  • a blank script > open a new R script to do some serious project
  • use only the console for simple calulations
  • the plot panel allows to see graphical results

Two main commands:

  • run is devoted to execute line by line the code
  • source will execute the whole script at once

3 R basics

Note: the Base R Cheat Sheet can be downloaded on the Rstudio website: https://www.rstudio.com/resources/cheatsheets/

3.1 R as simple calculus tool

R can be used to make simple (and less simple) caclulations:

2+2
## [1] 4
exp(-2)
## [1] 0.1353353

All calculations are made using vectors (or matrices later) and the [1] is a trace of this vector nature

rnorm(15)
##  [1] -0.63614366 -1.79161413 -0.68138714 -1.11974437  0.68346521
##  [6]  0.07838463 -1.39671431  3.47389963 -0.23125699 -1.24024898
## [11]  0.85310013 -0.01432660 -2.14595092  1.27271347  1.47870982

This command produces a vector of 15 values and the number within brackets at the beginning of the lines indicate the rank of the values in the vector.

x = 2

It is possible to store the results in variables. The = operator assigns the value to the variable. When all is ok, R does not display anything!

It is also possible to use the arrow operator -> which usually works in the same way as the = operator

a = c(1,2,3,4,5)

The c() function allows to create vectors. Warning: all vectors must be of the same type (numeric, characters, booleans)!

This is a vector of booleans:

a = c(TRUE,FALSE,TRUE,FALSE,FALSE)

All operators are working on vectors elements by elements:

a = c(1,2,3,4,5)
b = c(2,1,4,6,1)
a / b
## [1] 0.5000000 2.0000000 0.7500000 0.6666667 5.0000000

The resulting vector is of the same size of the two input vectors.

Note: the operator for matrix multiplication is %*%

c = a / b
c^2
## [1]  0.2500000  4.0000000  0.5625000  0.4444444 25.0000000

Here, the expentation is again element by element.

It is usually very easy to translate Maths into code:

xbar = sum(a) / length(a)
xbar
## [1] 3

3.2 How to find a function in the documentation

The R documentation is very well structured and it is easy to find information from it. A search in the documentation is usually in two steps:

  1. Search with a key word within the documentation:
??student
  1. Access the documentation of the wanted function:
?t.test

The documentation is always organised in the same manner:

  • Description: a short description in English of the method implemented in the function
  • Usage: the way the function should be used, with all input parameters and their default values
  • Arguments: a detailed explenation of the input parameters
  • Values: a detailed explenation of the output fields

There are plenty of already coded functions:

mean(a)
## [1] 3
sd(a)
## [1] 1.581139
range(a)
## [1] 1 5
max(a)
## [1] 5

Note: missing values are represented in R with the NA type. It is possible to locate missing values using the is.na() function.

3.3 Basics of ploting

In R, the generic plot function allows to output a graphic from some object:

a = c(1,2,3,4,5)
b = c(2,1,4,6,1)
plot(b,a)

plot(b,a,pch=19,col='lightblue',cex=2,type='b')
title(main='My wonderful title')

Note: a plot is build like a Lego, meaning it is not possible to mify something whitout restarting from zero!

3.4 Loading and saving data

Two options:

  1. Data in CSV file, or similar (mydata.csv)
  2. Data in Rdata format (mydata.Rdata)

How to proceed for CSV:

# Load the data into X
X = read.csv('path/toward/thefile.csv')

# Save some data into a file
write.csv(X,file='path/toward/thefile2.csv')

How to proceed for Rdata:

# Load the data
load('path/toward/thefile.Rdata')

# Save some data into a file
save(X,Y,a,b,out,file='path/toward/thefile2.Rdata')

Note: the rm(list=ls()) allows to clean the working space.

3.5 Exercice

Write a script allowing to load a vector file and “remove” the missing values.

# Load data
x = read.csv("path/to/myfile.csv")
x = as.vector(x$x) # to ensure that x is a vector

# Count the number of NAs
prop = sum(is.na(x)) / lenght(x)

# If too many NAs, remove them
if (prop > 0.05){
  x = x[-which(is.na(x))]
}

# Otherwise, replace NAs with the mean
if (prop <= 0.05){
  m = mean(x,na.rm = TRUE)
  x[which(is.na(x))] = m
}

4 Advanced R

4.1 Objects and expressions

First, it important to understand that R has two kinds of “objects”:

  • expresssions: operators and functions,
  • objects: vectors, matrices, lists, dataframes, …

To differentiate between functions and objects, functions have to be called every time with brackets ()!

plot = c(1,1,2,3,4,5)
plot(plot)

About the functions, the arguments (imput parameters) can be provided in any order:

x = c(1,2,3,4,5,6)
y = sin(x)
plot(x,y,type='b')

or

plot(y=y,x=x,col='red',type='b')

if you declare clearly which argument you are providing!

4.2 Vectors

Vectors can be created with the c() function:

x = c(1,2,3,4,5)
x = c(TRUE,FALSE,TRUE)
x = c('tyty',"tutu",'yuyu')

or some other ways:

x = seq(1,10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x = seq(1,11,by = 1.5)
x
## [1]  1.0  2.5  4.0  5.5  7.0  8.5 10.0
x = seq(1,10,length.out = 100)
x
##   [1]  1.000000  1.090909  1.181818  1.272727  1.363636  1.454545  1.545455
##   [8]  1.636364  1.727273  1.818182  1.909091  2.000000  2.090909  2.181818
##  [15]  2.272727  2.363636  2.454545  2.545455  2.636364  2.727273  2.818182
##  [22]  2.909091  3.000000  3.090909  3.181818  3.272727  3.363636  3.454545
##  [29]  3.545455  3.636364  3.727273  3.818182  3.909091  4.000000  4.090909
##  [36]  4.181818  4.272727  4.363636  4.454545  4.545455  4.636364  4.727273
##  [43]  4.818182  4.909091  5.000000  5.090909  5.181818  5.272727  5.363636
##  [50]  5.454545  5.545455  5.636364  5.727273  5.818182  5.909091  6.000000
##  [57]  6.090909  6.181818  6.272727  6.363636  6.454545  6.545455  6.636364
##  [64]  6.727273  6.818182  6.909091  7.000000  7.090909  7.181818  7.272727
##  [71]  7.363636  7.454545  7.545455  7.636364  7.727273  7.818182  7.909091
##  [78]  8.000000  8.090909  8.181818  8.272727  8.363636  8.454545  8.545455
##  [85]  8.636364  8.727273  8.818182  8.909091  9.000000  9.090909  9.181818
##  [92]  9.272727  9.363636  9.454545  9.545455  9.636364  9.727273  9.818182
##  [99]  9.909091 10.000000

This would be very useful to get a nice plot of the sinus function.

x = seq(-pi,pi,length.out = 100)
plot(x,sin(x),type='l')

You can also use the : operator which equivalent to the default seq() function.

x = seq(1:10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
y = 1:10
y
##  [1]  1  2  3  4  5  6  7  8  9 10
w = 0.5:10
w
##  [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5

And the last way of creating vectors is to use the rep function.

x = rep(0,10)
x
##  [1] 0 0 0 0 0 0 0 0 0 0

This is useful in particular to initialize a vector which will be used in a loop or a function.

Tip: always initialize your vectors with a “strange” value for the task (for instance NA’s)

m = rep(NA,10)
for (i in 1:10) m[i] = mean(rnorm(10))
m
##  [1] -0.40058418 -0.32261605 -0.12701528 -0.02434036 -0.29617637
##  [6] -0.52112056 -0.23050035  0.24287414  0.10853839  0.58870651

4.3 Matrices

Matrices can be created with the matrix function:

A = matrix(c(1,2,3,4,5,6),nrow = 2,byrow = TRUE)
A
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

Another option would be to use the cbind and rdind functions:

A = cbind(c(1,2),c(2,3),c(4,5))
A
##      [,1] [,2] [,3]
## [1,]    1    2    4
## [2,]    2    3    5
A = rbind(c(1,2,3),c(4,5,6))
A
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

To access vectors and matrices, we cannot use the barckets ()! Instead, we use the square brackets [...]

x = seq(1,10,length.out = 25)
# Access the 5th element of x
x[5]
## [1] 2.5
# Access all elements between the 5th and the 15th elements
x[5:15]
##  [1] 2.500 2.875 3.250 3.625 4.000 4.375 4.750 5.125 5.500 5.875 6.250
# Access all elements except those between 5 and 15
y = x[-(5:15)]
y
##  [1]  1.000  1.375  1.750  2.125  6.625  7.000  7.375  7.750  8.125  8.500
## [11]  8.875  9.250  9.625 10.000
length(y)
## [1] 14

With the square brackets, you can also modify the values of a vector:

x[5] = NA
x
##  [1]  1.000  1.375  1.750  2.125     NA  2.875  3.250  3.625  4.000  4.375
## [11]  4.750  5.125  5.500  5.875  6.250  6.625  7.000  7.375  7.750  8.125
## [21]  8.500  8.875  9.250  9.625 10.000

It is also possible to do some conditional access:

x[is.na(x)] = -1
x
##  [1]  1.000  1.375  1.750  2.125 -1.000  2.875  3.250  3.625  4.000  4.375
## [11]  4.750  5.125  5.500  5.875  6.250  6.625  7.000  7.375  7.750  8.125
## [21]  8.500  8.875  9.250  9.625 10.000
x[x < 4] = 0
x
##  [1]  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  4.000  4.375
## [11]  4.750  5.125  5.500  5.875  6.250  6.625  7.000  7.375  7.750  8.125
## [21]  8.500  8.875  9.250  9.625 10.000
x[x < 4 | x > 8] = NA
x
##  [1]    NA    NA    NA    NA    NA    NA    NA    NA 4.000 4.375 4.750
## [12] 5.125 5.500 5.875 6.250 6.625 7.000 7.375 7.750    NA    NA    NA
## [23]    NA    NA    NA

Note: the | operator denotes a OR. The AND is denoted by the & operator.

For matrices, you will have to provide position information about rows and columns:

A = diag(1,5)
A
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1
A[1:2,4:5]
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0

If you would like to access all rows (resp. columns) of a matrix, you can use the “nothing” operator:

A[,4:5]
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## [3,]    0    0
## [4,]    1    0
## [5,]    0    1

In some cases, you perhaps would like to access some specific locations:

A[c(1,3),c(2,5)]
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0

4.4 Lists

Lists are objects able to store information of different types. All functions return lists, so it is important to be able to manipulate them!

mylist = list(name='Charles',age=28,married=TRUE)
mylist
## $name
## [1] "Charles"
## 
## $age
## [1] 28
## 
## $married
## [1] TRUE
mylist$name = 'Tutu'
mylist
## $name
## [1] "Tutu"
## 
## $age
## [1] 28
## 
## $married
## [1] TRUE
mylist$Nb_of_cats = 4
mylist
## $name
## [1] "Tutu"
## 
## $age
## [1] 28
## 
## $married
## [1] TRUE
## 
## $Nb_of_cats
## [1] 4
out = t.test(1:10, y = c(7:20))
out
## 
##  Welch Two Sample t-test
## 
## data:  1:10 and c(7:20)
## t = -5.4349, df = 21.982, p-value = 1.855e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.052802  -4.947198
## sample estimates:
## mean of x mean of y 
##       5.5      13.5
print.default(out)
## $statistic
##        t 
## -5.43493 
## 
## $parameter
##       df 
## 21.98221 
## 
## $p.value
## [1] 1.855282e-05
## 
## $conf.int
## [1] -11.052802  -4.947198
## attr(,"conf.level")
## [1] 0.95
## 
## $estimate
## mean of x mean of y 
##       5.5      13.5 
## 
## $null.value
## difference in means 
##                   0 
## 
## $alternative
## [1] "two.sided"
## 
## $method
## [1] "Welch Two Sample t-test"
## 
## $data.name
## [1] "1:10 and c(7:20)"
## 
## attr(,"class")
## [1] "htest"
out$statistic
##        t 
## -5.43493
sqrt(abs(out$statistic))
##        t 
## 2.331294
out$p.value
## [1] 1.855282e-05

4.5 Data frames

The last object is the data.frame and it is devoted to store data sets. In a data.frame, rows are for individuals and columns are for variables.

# We create a data.frame with 28 individuals and 3 variables (2 numeric and 1 categorical)
X = data.frame(x = seq(1,10,length.out = 28),
               y = seq(-3,3,length.out = 28),
               sex = c(rep('M',10),rep('F',18)))
X
##            x          y sex
## 1   1.000000 -3.0000000   M
## 2   1.333333 -2.7777778   M
## 3   1.666667 -2.5555556   M
## 4   2.000000 -2.3333333   M
## 5   2.333333 -2.1111111   M
## 6   2.666667 -1.8888889   M
## 7   3.000000 -1.6666667   M
## 8   3.333333 -1.4444444   M
## 9   3.666667 -1.2222222   M
## 10  4.000000 -1.0000000   M
## 11  4.333333 -0.7777778   F
## 12  4.666667 -0.5555556   F
## 13  5.000000 -0.3333333   F
## 14  5.333333 -0.1111111   F
## 15  5.666667  0.1111111   F
## 16  6.000000  0.3333333   F
## 17  6.333333  0.5555556   F
## 18  6.666667  0.7777778   F
## 19  7.000000  1.0000000   F
## 20  7.333333  1.2222222   F
## 21  7.666667  1.4444444   F
## 22  8.000000  1.6666667   F
## 23  8.333333  1.8888889   F
## 24  8.666667  2.1111111   F
## 25  9.000000  2.3333333   F
## 26  9.333333  2.5555556   F
## 27  9.666667  2.7777778   F
## 28 10.000000  3.0000000   F

Data.frame can be used as matrices or lists:

X[5:10,]
##           x         y sex
## 5  2.333333 -2.111111   M
## 6  2.666667 -1.888889   M
## 7  3.000000 -1.666667   M
## 8  3.333333 -1.444444   M
## 9  3.666667 -1.222222   M
## 10 4.000000 -1.000000   M
X$sex
##  [1] M M M M M M M M M M F F F F F F F F F F F F F F F F F F
## Levels: F M
X[,3]
##  [1] M M M M M M M M M M F F F F F F F F F F F F F F F F F F
## Levels: F M
summary(X)
##        x               y        sex   
##  Min.   : 1.00   Min.   :-3.0   F:18  
##  1st Qu.: 3.25   1st Qu.:-1.5   M:10  
##  Median : 5.50   Median : 0.0         
##  Mean   : 5.50   Mean   : 0.0         
##  3rd Qu.: 7.75   3rd Qu.: 1.5         
##  Max.   :10.00   Max.   : 3.0

Some useful functions to get information about objects:

#Length of a vector
length(x)
## [1] 25
# Length of a list (nb of fields)
length(mylist)
## [1] 4
# For matrices and data.frames
nrow(X)
## [1] 28
ncol(X)
## [1] 3
# For data.frames
colnames(X)
## [1] "x"   "y"   "sex"
rownames(X)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"

It is also possible to modify the names of variables and individuals

colnames(X) = c('Var 1','Var 2','Var 3')
X
##        Var 1      Var 2 Var 3
## 1   1.000000 -3.0000000     M
## 2   1.333333 -2.7777778     M
## 3   1.666667 -2.5555556     M
## 4   2.000000 -2.3333333     M
## 5   2.333333 -2.1111111     M
## 6   2.666667 -1.8888889     M
## 7   3.000000 -1.6666667     M
## 8   3.333333 -1.4444444     M
## 9   3.666667 -1.2222222     M
## 10  4.000000 -1.0000000     M
## 11  4.333333 -0.7777778     F
## 12  4.666667 -0.5555556     F
## 13  5.000000 -0.3333333     F
## 14  5.333333 -0.1111111     F
## 15  5.666667  0.1111111     F
## 16  6.000000  0.3333333     F
## 17  6.333333  0.5555556     F
## 18  6.666667  0.7777778     F
## 19  7.000000  1.0000000     F
## 20  7.333333  1.2222222     F
## 21  7.666667  1.4444444     F
## 22  8.000000  1.6666667     F
## 23  8.333333  1.8888889     F
## 24  8.666667  2.1111111     F
## 25  9.000000  2.3333333     F
## 26  9.333333  2.5555556     F
## 27  9.666667  2.7777778     F
## 28 10.000000  3.0000000     F
rownames(X) = paste('Individual',1:28)
X
##                   Var 1      Var 2 Var 3
## Individual 1   1.000000 -3.0000000     M
## Individual 2   1.333333 -2.7777778     M
## Individual 3   1.666667 -2.5555556     M
## Individual 4   2.000000 -2.3333333     M
## Individual 5   2.333333 -2.1111111     M
## Individual 6   2.666667 -1.8888889     M
## Individual 7   3.000000 -1.6666667     M
## Individual 8   3.333333 -1.4444444     M
## Individual 9   3.666667 -1.2222222     M
## Individual 10  4.000000 -1.0000000     M
## Individual 11  4.333333 -0.7777778     F
## Individual 12  4.666667 -0.5555556     F
## Individual 13  5.000000 -0.3333333     F
## Individual 14  5.333333 -0.1111111     F
## Individual 15  5.666667  0.1111111     F
## Individual 16  6.000000  0.3333333     F
## Individual 17  6.333333  0.5555556     F
## Individual 18  6.666667  0.7777778     F
## Individual 19  7.000000  1.0000000     F
## Individual 20  7.333333  1.2222222     F
## Individual 21  7.666667  1.4444444     F
## Individual 22  8.000000  1.6666667     F
## Individual 23  8.333333  1.8888889     F
## Individual 24  8.666667  2.1111111     F
## Individual 25  9.000000  2.3333333     F
## Individual 26  9.333333  2.5555556     F
## Individual 27  9.666667  2.7777778     F
## Individual 28 10.000000  3.0000000     F
X["Individual 26",]
##                  Var 1    Var 2 Var 3
## Individual 26 9.333333 2.555556     F

4.6 Some exercises

Exercise: Create two vectors of different dimensions and insert the second one in the first one between the 2nd and 3rd elements.

a = c(1,2,3,4,5)
b = runif(10)
c = c(a[1:2],b,a[-(1:2)])
c
##  [1] 1.00000000 2.00000000 0.48180683 0.13411387 0.42401444 0.96252263
##  [7] 0.37654661 0.73015810 0.86870886 0.04555321 0.74810374 0.38847350
## [13] 3.00000000 4.00000000 5.00000000

Exercise: Draw 100 numbers from a Uniform distribution on [0,1] and count how many values are larger than 0.5

x = runif(100,min = 0,max = 1)
sum(x>=0.5)
## [1] 50
length(x[x>=0.5])
## [1] 50

Exercise: Draw 100 couples \((x,y)\) randomly in the square \([0,1]^2\) (use a data.frame to store the data). Count how many couples are within the square \([0.5,1]^2\). Plot the couples and display with a red point the mean of all points.

X = data.frame(x = runif(100), y = runif(100))
sum(X$x >= 0.5 & X$y >= 0.5)
## [1] 24
plot(X,col='lightblue',pch=19)
mean = colMeans(X)
points(mean[1],mean[2],col='red',pch=19,cex=2)

Exercise: It is Spring! We consider a field of flower bulbes. We have 100 bulbes in a field of 1m x 1m. We know that the probability that a bulbe blooms is 0.85 (use a data.frame to store the data). Simulate the blossoming of the bulbes!

X = data.frame(x = runif(100), y = runif(100),
               b = rbinom(100,size = 1,prob = 0.85))
X
##               x          y b
## 1   0.482385463 0.95272036 0
## 2   0.935273987 0.48938969 1
## 3   0.833426535 0.36876173 1
## 4   0.491739545 0.21967751 1
## 5   0.512224207 0.22175255 1
## 6   0.169196047 0.81228363 1
## 7   0.798336988 0.41465905 1
## 8   0.915688189 0.52225389 1
## 9   0.901797233 0.97200626 1
## 10  0.594102430 0.39050110 1
## 11  0.709332747 0.17575912 1
## 12  0.668688073 0.74051950 1
## 13  0.190941632 0.92416120 1
## 14  0.123592664 0.55418842 1
## 15  0.270830068 0.89318131 1
## 16  0.808378143 0.62003766 1
## 17  0.599745586 0.68595147 1
## 18  0.827772828 0.93299986 1
## 19  0.794061270 0.61640828 1
## 20  0.651917435 0.19099553 1
## 21  0.154867405 0.42605300 0
## 22  0.759520175 0.76551816 1
## 23  0.746298232 0.16841549 1
## 24  0.738623844 0.04501963 1
## 25  0.104805069 0.96593480 1
## 26  0.854276683 0.92916019 1
## 27  0.547584553 0.08269657 1
## 28  0.014208163 0.23111107 1
## 29  0.002305876 0.97891846 1
## 30  0.311017394 0.09976388 1
## 31  0.136308048 0.68807202 1
## 32  0.515896664 0.67555061 1
## 33  0.454988125 0.63667950 1
## 34  0.437051728 0.90632182 1
## 35  0.166306296 0.29200164 1
## 36  0.743982983 0.87407347 1
## 37  0.822936576 0.95302615 1
## 38  0.359468095 0.26662019 1
## 39  0.785042912 0.01828725 1
## 40  0.395148351 0.06564920 1
## 41  0.690722077 0.92359795 1
## 42  0.797622001 0.93304227 1
## 43  0.717520420 0.79626286 1
## 44  0.061249074 0.75552193 1
## 45  0.658153805 0.98950556 1
## 46  0.475078070 0.51126781 1
## 47  0.484872019 0.69078698 1
## 48  0.366873633 0.28560776 1
## 49  0.214307128 0.15989784 1
## 50  0.225645179 0.38124649 1
## 51  0.584252419 0.48223083 1
## 52  0.558898341 0.58456680 1
## 53  0.650039612 0.73270031 0
## 54  0.499680690 0.69274960 0
## 55  0.880561361 0.52353401 1
## 56  0.590076074 0.42514644 1
## 57  0.265238620 0.03443015 1
## 58  0.217237896 0.49831299 1
## 59  0.669885437 0.39391366 1
## 60  0.763745511 0.36414456 1
## 61  0.425271056 0.78846814 1
## 62  0.662841910 0.73856696 1
## 63  0.384085254 0.55445714 1
## 64  0.558712669 0.75126621 1
## 65  0.879896356 0.45663479 1
## 66  0.130927367 0.48334692 1
## 67  0.553595135 0.50196796 1
## 68  0.680771236 0.05204429 1
## 69  0.127966061 0.37342640 1
## 70  0.739712183 0.64602212 1
## 71  0.338359063 0.21828268 1
## 72  0.343835741 0.03412282 1
## 73  0.307959618 0.84937405 0
## 74  0.320364905 0.49873904 1
## 75  0.379980090 0.22769280 1
## 76  0.812421255 0.90316739 1
## 77  0.858120196 0.28602949 0
## 78  0.703462673 0.68164390 1
## 79  0.937758762 0.56573134 1
## 80  0.391415598 0.12947586 1
## 81  0.101512408 0.17192509 1
## 82  0.792423530 0.17843453 1
## 83  0.936878131 0.39919860 1
## 84  0.413874270 0.76720697 1
## 85  0.987793410 0.39508023 1
## 86  0.946944396 0.25834206 0
## 87  0.728985428 0.56984491 1
## 88  0.835279047 0.51134936 1
## 89  0.522861788 0.62924206 1
## 90  0.650369765 0.79743108 1
## 91  0.875004860 0.65072493 1
## 92  0.794561896 0.04576638 1
## 93  0.380683979 0.01460950 1
## 94  0.335860675 0.56088856 1
## 95  0.682850689 0.12609909 1
## 96  0.973856049 0.00358598 1
## 97  0.299150000 0.12330233 1
## 98  0.278790976 0.91354638 1
## 99  0.146340450 0.54135608 1
## 100 0.491211016 0.02464513 1
plot(X[,1:2],pch=19,col=X$b+1)
nb = sum(X$b == 1)
title(main = paste(nb,"bulbes become flowers!"))
legend("bottomleft",legend = c('Failed','Bloomed'),col=c(1,2),pch=19)

5 The graphical system

5.1 Basic plots and legends

The graphical system is very complex and works in Lego fashion, i.e. it is not possible to remove something on plot wihout clearing everything!

Of course, it is possible to add on a figure texts, legend, title, etc … and to save the figure in an image format (png, jpg, pdf, eps, …)

The window of the plot is automatically set up according to the data scale.

x = runif(100)
y = rnorm(100)
plot(x,y,pch=19)

It is possible to resize the window by using the xlim and ylim arguments.

plot(x,y,pch=19,xlim=c(-1,2),ylim=c(-3,4))
points(1.5,0,col='red')
points(0,3,col='blue')

plot(x[x >= 0.5],y[x >= 0.5],pch=19,
     xlab='x',ylab='y',main='My title')
title(sub='My subtitle')
legend("bottomright",legend = 'my points',col=1,pch=19)
text(x,y,1:length(x >= 0.5),pos = 1)

To draw some line (for instance for linear regression), we use the abline function:

plot(x,x+rnorm(100,0,0.2),pch=19,ylab='y')
abline(a=0,b=1,lty=2,lwd=2,col='red')
abline(v = 0.5,lty=3,col='blue')

It is also possible to add texts in the margin:

plot(x,x+rnorm(100,0,0.2),pch=19,ylab='y')
abline(a=0,b=1,lty=2,lwd=2,col='red')
abline(v = 0.5,lty=3,col='blue')
mtext("A nice linear model",side = 4)

5.2 Basic statistics with R plots

Histograms are a way to look at the distribution of the data.

x = rnorm(200,mean = 2,sd = 1.5)
hist(x,col='lavender')

Be careful with histograms: it is not so easy to draw a great histogram! In addition, keep in mind that it is possible to manipulate a histogram to pass a specific message!

For example, this is the default result for a uniformly distributed variable:

x = runif(100,0,1)
hist(x)

And this are other possible histograms for the same data:

par(mfrow=c(1,2))
hist(x,breaks=2)
hist(x,breaks=100)

It is possible to combine or replace the histogram with a plot o the estimated density function. The function to estimate the density function is density:

x = c(rnorm(100,2,1),rgamma(50,shape = 1))
hist(x,freq = FALSE)
f = density(x)
lines(f,lwd=2,col='red',lty=2)

It is possible to compare the empirical distribution of the data with the theorical function of some specific distribution:

x = c(rnorm(100,2,1),rgamma(50,shape = 1))
qqnorm(x)

x = c(rnorm(100,2,1),rgamma(50,shape = 1))
qqplot(x,y = rgamma(1000,shape=1))

Perhaps, the best way to look at the distribution of some data is to do a boxplot. Boxplots are great becaus the recipie is unique!

par(mfrow=c(1,2))
boxplot(rnorm(100),title='Normal')
boxplot(runif(100),title='Uniform')

par(mfrow=c(1,2))
boxplot(rnorm(100),title='Normal')
boxplot(rgamma(100,shape=1),title='Gamma')

But, it is a very simple plot and it is not easy to look at complex distributions:

x = c(rnorm(100,2,1),rgamma(50,shape = 1))
boxplot(x,title='Mixture dist.')

Interestingly, it is possible to display multavariate data with boxplots:

x = iris[,-5]
boxplot(x)

This is very helpful but limited to data on the same unit!

To represent multivariate data, we also have the pairplot funtion.

x = iris[,-5]
pairs(x)

Remark: Keep in mind that some information can remain hidden in high-dimensional spaces!

It is possible to add categorical variables on a plot by changing the colors or the forms of the points.

x = iris[,-5]
y = iris$Species
pairs(x,col=as.numeric(y),pch=as.numeric(y))

Finally, for categorical variables, you can draw pie plots:

y = iris$Species
pie(summary(y))

5.3 The ggplot package

We are going to use an extra package, ggplot, that can be download on the CRAN repository.

We are going to use the mpg data set to illustrate the use of the package:

library(ggplot2)
?mpg
mpg
## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    cla~
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
##  1 audi         a4      1.8  1999     4 auto~ f        18    29 p     com~
##  2 audi         a4      1.8  1999     4 manu~ f        21    29 p     com~
##  3 audi         a4      2    2008     4 manu~ f        20    31 p     com~
##  4 audi         a4      2    2008     4 auto~ f        21    30 p     com~
##  5 audi         a4      2.8  1999     6 auto~ f        16    26 p     com~
##  6 audi         a4      2.8  1999     6 manu~ f        18    26 p     com~
##  7 audi         a4      3.1  2008     6 auto~ f        18    27 p     com~
##  8 audi         a4 q~   1.8  1999     4 manu~ 4        18    26 p     com~
##  9 audi         a4 q~   1.8  1999     4 auto~ 4        16    25 p     com~
## 10 audi         a4 q~   2    2008     4 manu~ 4        20    28 p     com~
## # ... with 224 more rows

The ggplot package has a very specific wayt of working:

ggplot(data = mpg) + geom_histogram(aes(x = hwy))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice that geom_histogram always use 30 bins for making the plot! It is necessary to provide a better value if you want!

ggplot(data = mpg) + geom_histogram(aes(x = hwy), bins=50, fill = 'pink')

ggplot(data = mpg) + geom_histogram(aes(x = hwy, y = ..density..), bins=50, fill = 'pink') + geom_density(aes(x = hwy),col = 'green')

It is also possible to draw boxplots:

boxplot(mpg$hwy)

ggplot(data = mpg) + geom_boxplot(aes(x = '',y = hwy), fill = 'lightblue')

Let’s compare the distribution of the hwy variable with one normal distribution with the quantile-quantile plot (qqplot for friends!):

ggplot(data = mpg) +
  geom_qq(aes(sample = hwy)) +
  geom_qq_line(aes(sample = hwy))

ggplot(data = mpg) +
  geom_qq(aes(sample = hwy),distribution = stats::qgamma,dparams = (shape=1)) + 
    geom_qq_line(aes(sample = hwy),distribution = stats::qgamma,dparams = (shape=1))

For categorical data, we also have some barplots:

ggplot(data = mpg) + geom_bar(aes(x = trans))

It is also possible to have some original graphics that are not easy to obtain with the usual plot function, like this one:

ggplot(data = mpg) + geom_bar(aes(x = "", fill = factor(trans)))

To draw a barplot:

ggplot(data = mpg) + geom_bar(aes(x = "", fill = factor(trans))) + coord_polar(theta = "y")

For two-dimensional data, we have also specific plots:

ggplot(data = mpg) + geom_point(aes(x = displ,y = hwy))

ggplot(data = mpg) + geom_point(aes(x = displ,y = hwy,col = class, shape = factor(year)))

It is also possible to display the data conditionnaly on one categorical variable (here the class variable):

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, col = class)) + facet_wrap( ~ class, nrow = 3)

It is even possible to make the plots conditionnaly on two categorical variables:

ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)

Let’s try to do the same without ggplot:

par(mfrow=c(3,4))
for (m in unique(mpg$drv))
  for (v in unique(mpg$cyl)){
    sel = which(mpg$drv == m & mpg$cyl == v)
    if (length(sel)>0) plot(mpg$displ[sel],mpg$hwy[sel],pch=19)
    else plot(0,0,type='n')
  }

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(mapping = aes(x = displ, y = hwy),method = 'lm')

If one wuld like to interpret the coefficients of the lm model, it is necessary to use directrly the lm function:

out = lm(mpg$hwy ~ 1 + mpg$displ)
out
## 
## Call:
## lm(formula = mpg$hwy ~ 1 + mpg$displ)
## 
## Coefficients:
## (Intercept)    mpg$displ  
##      35.698       -3.531
plot(out)

plot( mpg$displ, mpg$hwy, type='p')
abline(out, col='red')

It is worth noticing that the ggplot graphic is more meaningful regarding the distribution of the residuals.

When working with multivariate categorical data, it still possible to have nice plots!

table(mpg$cyl,mpg$drv)
##    
##      4  f  r
##   4 23 58  0
##   5  0  4  0
##   6 32 43  4
##   8 48  1 21
ggplot(data = mpg) + 
  geom_bar(aes(x = factor(cyl), fill = factor(drv)))

ggplot(data = mpg) + 
  geom_bar(aes(x = factor(drv), fill = factor(cyl)))

There are variants of this plot, that can be obtained using the position option.

ggplot(data = mpg) + geom_bar(aes(x = factor(cyl), fill = factor(drv)), position = "dodge")

ggplot(data = mpg) + geom_bar(aes(x = factor(cyl), fill = factor(drv)), position = "fill")

ggplot(data = mpg) + 
  geom_bar(aes(x = "", fill = factor(cyl)), position = "fill") + 
  coord_polar(theta = "y") +
  facet_wrap(~ factor(drv))

Remark: in the two last plots, information about the counts are missing and this could be dangerous when intyerpreting…

tapply(mpg$displ, mpg$cyl, mean)
##        4        5        6        8 
## 2.145679 2.500000 3.408861 5.132857
ggplot(data = mpg) + geom_boxplot(aes(x = factor(cyl), y = displ))

ggplot(data = mpg) + geom_histogram(aes(x = displ)) + facet_wrap(~ factor(cyl), ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = mpg) + geom_density(aes(x = displ, col = factor(cyl)))

Exercise: Use the starwars data set in thedplyr` package to:

  • list the different human characters,
  • list the different worlds,
  • compute the average weight and height of the different character types,
  • display on a plot the number of characters of each type in a deacresing order,
  • visualize the relationship between the height and weight of the different characters.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
?starwars
starwars$name[starwars$species == 'Human']
##  [1] "Luke Skywalker"      "Darth Vader"         "Leia Organa"        
##  [4] "Owen Lars"           "Beru Whitesun lars"  "Biggs Darklighter"  
##  [7] "Obi-Wan Kenobi"      "Anakin Skywalker"    "Wilhuff Tarkin"     
## [10] "Han Solo"            "Wedge Antilles"      "Jek Tono Porkins"   
## [13] "Palpatine"           "Boba Fett"           "Lando Calrissian"   
## [16] "Lobot"               "Mon Mothma"          "Arvel Crynyd"       
## [19] "Qui-Gon Jinn"        "Finis Valorum"       NA                   
## [22] NA                    "Shmi Skywalker"      "Mace Windu"         
## [25] "Gregar Typho"        "Cord<U+00E9>"        "Cliegg Lars"        
## [28] "Dorm<U+00E9>"        "Dooku"               "Bail Prestor Organa"
## [31] "Jango Fett"          "Jocasta Nu"          NA                   
## [34] "Raymus Antilles"     NA                    "Finn"               
## [37] "Rey"                 "Poe Dameron"         NA                   
## [40] "Padm<U+00E9> Amidala"
unique(starwars$homeworld)
##  [1] "Tatooine"       "Naboo"          "Alderaan"       "Stewjon"       
##  [5] "Eriadu"         "Kashyyyk"       "Corellia"       "Rodia"         
##  [9] "Nal Hutta"      "Bestine IV"     NA               "Kamino"        
## [13] "Trandosha"      "Socorro"        "Bespin"         "Mon Cala"      
## [17] "Chandrila"      "Endor"          "Sullust"        "Cato Neimoidia"
## [21] "Coruscant"      "Toydaria"       "Malastare"      "Dathomir"      
## [25] "Ryloth"         "Vulpter"        "Troiken"        "Tund"          
## [29] "Haruun Kal"     "Cerea"          "Glee Anselm"    "Iridonia"      
## [33] "Iktotch"        "Quermia"        "Dorin"          "Champala"      
## [37] "Geonosis"       "Mirial"         "Serenno"        "Concord Dawn"  
## [41] "Zolan"          "Ojom"           "Aleen Minor"    "Skako"         
## [45] "Muunilinst"     "Shili"          "Kalee"          "Umbara"        
## [49] "Utapau"
#levels(as.factor(starwars$homeworld))
weight = tapply(starwars$mass, starwars$species, mean)
height = tapply(starwars$height, starwars$species, mean)
Species = data.frame(weight,height)
Species
##                weight   height
## Aleena           15.0  79.0000
## Besalisk        102.0 198.0000
## Cerean           82.0 198.0000
## Chagrian           NA 196.0000
## Clawdite         55.0 168.0000
## Droid              NA       NA
## Dug              40.0 112.0000
## Ewok             20.0  88.0000
## Geonosian        80.0 183.0000
## Gungan             NA 208.6667
## Human              NA       NA
## Hutt           1358.0 175.0000
## Iktotchi           NA 188.0000
## Kaleesh         159.0 216.0000
## Kaminoan           NA 221.0000
## Kel Dor          80.0 188.0000
## Mirialan         53.1 168.0000
## Mon Calamari     83.0 180.0000
## Muun               NA 191.0000
## Nautolan         87.0 196.0000
## Neimodian        90.0 191.0000
## Pau'an           80.0 206.0000
## Quermian           NA 264.0000
## Rodian           74.0 173.0000
## Skakoan          48.0 193.0000
## Sullustan        68.0 160.0000
## Tholothian       50.0 184.0000
## Togruta          57.0 178.0000
## Toong            65.0 163.0000
## Toydarian          NA 137.0000
## Trandoshan      113.0 190.0000
## Twi'lek            NA 179.0000
## Vulptereen       45.0  94.0000
## Wookiee         124.0 231.0000
## Xexto              NA 122.0000
## Yoda's species   17.0  66.0000
## Zabrak             NA 173.0000
ggplot(data = Species) + geom_point(aes(x = weight,y = height)) +
  geom_text(aes(x = weight,y = height,label = rownames(Species)),nudge_y = 10)
## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 12 rows containing missing values (geom_text).

Species$count = table(starwars$species)
Species = Species[order(Species$count,decreasing = TRUE),]
#ggplot(data = Species) + geom_bar(aes(x = count))
barplot(Species$count)

ggplot(data = starwars) +
  geom_point(aes(x = mass,y = height)) +
  geom_smooth(aes(x = mass,y = height),method = 'lm')
## Warning: Removed 28 rows containing non-finite values (stat_smooth).
## Warning: Removed 28 rows containing missing values (geom_point).

Remark: we can see here that the linear model is very sensitive to ouliers!

6 Programming with R

6.1 Basics of R programming

6.2 Advanced tools

6.3 Building R packages

7 Reporting with R

7.1 Rmarkdown

7.2 Shiny

8 Handling massive data sets with dplyr